response quality
AI-Assisted Conversational Interviewing: Effects on Data Quality and Respondent Experience
Barari, Soubhik, Angbazo, Jarret, Wang, Natalie, Christian, Leah M., Dean, Elizabeth, Slowinski, Zoe, Sepulvado, Brandon
Standardized surveys scale efficiently but sacrifice depth, while conversational interviews improve response quality at the cost of scalability and consistency. This study bridges the gap between these methods by introdu cing a framework for AI - assisted conversational interviewing. To evaluate this framework, we conducted a web survey experiment where 1,800 p articipants were randomly assigned to AI ' chatbots ' which use large language models (LLMs) to dynamically probe respondents for elaboration and interactively code open - ended responses to fixed questions developed by human researchers . We assessed the AI chatbot's performance in terms of coding accuracy, response quality, and respondent experience. Our findings reveal that AI chatbots perform moderately well in live coding even without survey - specific fine - tuning, despite slightly inflated false positive err ors due to respondent acquiescence bias. Open - ended responses were more detailed and informative, but this came at a slight cost to respondent experience. Our findings highlight the feasibility of using AI methods such as chatbots enhanced by LLMs to enhance open - ended data collection in web surveys. 2
- Asia > Middle East > Jordan (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Personal > Interview (1.00)
- Leisure & Entertainment (0.92)
- Government > Regional Government (0.67)
- Banking & Finance (0.67)
- Media > News (0.46)
EduMod-LLM: A Modular Approach for Designing Flexible and Transparent Educational Assistants
Mittal, Meenakshi, Khare, Rishi, Miroyan, Mihran, Mitra, Chancharik, Norouzi, Narges
With the growing use of Large Language Model (LLM)- based Question-Answering (QA) systems in education, it is critical to evaluate their performance across individual pipeline components. In this work, we introduce EduMod-LLM, a modular function-calling LLM pipeline, and present a comprehensive evaluation along three key axes: function calling strategies, retrieval methods, and generative language models. Our framework enables fine-grained analysis by isolating and assessing each component. We benchmark function-calling performance across LLMs, compare our novel structure-aware retrieval method to vector-based and LLM-scoring baselines, and evaluate various LLMs for response synthesis. This modular approach reveals specific failure modes and performance patterns, supporting the development of interpretable and effective educational QA systems. Our findings demonstrate the value of modular function calling in improving system transparency and pedagogical alignment.
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Research Report > New Finding (1.00)
- Instructional Material > Course Syllabus & Notes (1.00)
- Information Technology (0.93)
- Education > Curriculum (0.68)
- Education > Educational Setting > Higher Education (0.46)
HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning
Gupta, Nikunj, Guo, Bill, Kannan, Rajgopal, Prasanna, Viktor K.
Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here https://github.com/ Nikunj-Gupta/hierouter.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)
Conversational Intent-Driven GraphRAG: Enhancing Multi-Turn Dialogue Systems through Adaptive Dual-Retrieval of Flow Patterns and Context Semantics
Zhu, Ziqi, Hu, Tao, Zhang, Honglong, Yang, Dan, Chen, HanGeng, Zhang, Mengran, Chen, Xilun
We present CID-GraphRAG (Conversational Intent-Driven Graph Retrieval Augmented Generation), a novel framework that addresses the limitations of existing dialogue systems in maintaining both contextual coherence and goal-oriented progression in multi-turn customer service conversations. Unlike traditional RAG systems that rely solely on semantic similarity (Conversation RAG) or standard knowledge graphs (GraphRAG), CID-GraphRAG constructs dynamic intent transition graphs from goal achieved historical dialogues and implements a dual-retrieval mechanism that adaptively balances intent-based graph traversal with semantic search. This approach enables the system to simultaneously leverage both conversional intent flow patterns and contextual semantics, significantly improving retrieval quality and response quality. In extensive experiments on real-world customer service dialogues, we employ both automatic metrics and LLM-as-judge assessments, demonstrating that CID-GraphRAG significantly outperforms both semantic-based Conversation RAG and intent-based GraphRAG baselines across all evaluation criteria. Quantitatively, CID-GraphRAG demonstrates substantial improvements over Conversation RAG across automatic metrics, with relative gains of 11% in BLEU, 5% in ROUGE-L, 6% in METEOR, and most notably, a 58% improvement in response quality according to LLM-as-judge evaluations. These results demonstrate that the integration of intent transition structures with semantic retrieval creates a synergistic effect that neither approach achieves independently, establishing CID-GraphRAG as an effective framework for addressing the challenges of maintaining contextual coherence and goal-oriented progression in knowledge-intensive multi-turn dialogues.
MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique
Zeng, Gailun, Luo, Ziyang, Lin, Hongzhan, Tian, Yuchen, Li, Kaixin, Gong, Ziyang, Guo, Jianxiong, Ma, Jing
The ability of critique is vital for models to self-improve and serve as reliable AI assistants. While extensively studied in language-only settings, multimodal critique of Large Multimodal Models (LMMs) remains underexplored despite their growing capabilities in tasks like captioning and visual reasoning. In this work, we introduce MM-CRITIC, a holistic benchmark for evaluating the critique ability of LMMs across multiple dimensions: basic, correction, and comparison. Covering 8 main task types and over 500 tasks, MM-CRITIC collects responses from various LMMs with different model sizes and is composed of 4471 samples. To enhance the evaluation reliability, we integrate expert-informed ground answers into scoring rubrics that guide GPT-4o in annotating responses and generating reference critiques, which serve as anchors for trustworthy judgments. Extensive experiments validate the effectiveness of MM-CRITIC and provide a comprehensive assessment of leading LMMs' critique capabilities under multiple dimensions. Further analysis reveals some key insights, including the correlation between response quality and critique, and varying critique difficulty across evaluation dimensions. Our code is available at https://github.com/MichealZeng0420/MM-Critic.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Whitehouse, Chenxi, Ruder, Sebastian, Lin, Tony, Kurylo, Oksana, Takagi, Haruka, Lam, Janice, Busetto, Nicolò, Diaz, Denise, Guzmán, Francisco
Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.Dataset https://huggingface.co/datasets/facebook/menlo In order for LLMs to be most useful across the globe, they need to be able to provide high-quality responses in many languages. Responses should be relevant (Zhuang et al., 2024), factually accurate (Jacovi et al., 2025), and natural (Marchisio et al., 2024; Guo et al., 2025), among other considerations. Ultimately, for interaction in any language to be seamless, responses need to be indistinguishable from those of a native speaker (Novikova et al., 2016; Liu et al., 2021). Language proficiency in humans has traditionally been evaluated via standardized tests (Jamieson et al., 2000). While such tests have been applied to evaluating LLMs (Anil et al., 2023; Mayor-Rocher et al., 2024; Lothritz & Cabot, 2025), they are difficult to scale and do not readily correspond to real-world conversations.
- Europe > Austria > Vienna (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Indonesia > Bali (0.04)
- (20 more...)
Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks
Deng, Xiumei, Xiong, Zehui, Chen, Binbin, Kim, Dong In, Debbah, Merouane, Poor, H. Vincent
Large language models (LLMs) are proliferating rapidly at the edge, delivering intelligent capabilities across diverse application scenarios. However, their practical deployment in collaborative scenarios confronts fundamental challenges: privacy vulnerabilities, communication overhead, and computational bottlenecks. To address these, we propose Federated Attention (FedAttn), which integrates the federated paradigm into the self-attention mechanism, creating a new distributed LLM inference framework that simultaneously achieves privacy protection, communication efficiency, and computational efficiency. FedAttn enables participants to perform local self-attention over their own token representations while periodically exchanging and aggregating Key-Value (KV) matrices across multiple Transformer blocks, collaboratively generating LLM responses without exposing private prompts. Further, we identify a structural duality between contextual representation refinement in FedAttn and parameter optimization in FL across private data, local computation, and global aggregation. This key insight provides a principled foundation for systematically porting federated optimization techniques to collaborative LLM inference. Building on this framework, we theoretically analyze how local self-attention computation within participants and heterogeneous token relevance among participants shape error propagation dynamics across Transformer blocks. Moreover, we characterize the fundamental trade-off between response quality and communication/computation efficiency, which is governed by the synchronization interval and the number of participants. Experimental results validate our theoretical analysis, and reveal significant optimization opportunities through sparse attention and adaptive KV aggregation, highlighting FedAttn's potential to deliver scalability and efficiency in real-world edge deployments.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- (16 more...)
- Information Technology > Security & Privacy (1.00)
- Law (0.93)
BEACON: Bayesian Optimal Stopping for Efficient LLM Sampling
Wan, Guangya, Xu, Zixin Stephen, Zorc, Sasa, Baucells, Manel, Hu, Mengxuan, Wang, Hao, Li, Sheng
Sampling multiple responses is a common way to improve LLM output quality, but it comes at the cost of additional computation. The key challenge is deciding when to stop generating new samples to balance accuracy gains against efficiency. To address this, we introduce BEACON (Bayesian Efficient Adaptive Criterion for Optimal N-stopping), a principled adaptive sampling framework grounded in Sequential Search with Bayesian Learning. BEACON sequentially generates responses from the policy LLM, updates posterior belief over reward distributions in real time without further training, and determines when to stop by weighing expected gains against computational cost. Sampling terminates once the marginal utility of further exploration no longer justifies the expense. We establish both theoretical optimality guarantees and practical tractability, and show empirically that BEACON reduces average sampling by up to 80% while maintaining response quality. We further demonstrate BEACON's utility for cost-efficient preference data generation and outline practical extensions, offering actionable insights for future researchers.
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Virginia (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.66)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)
IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs
Feng, Aosong, Srinivasan, Balasubramaniam, Zhou, Yun, Xu, Zhichao, Zhou, Kang, Guan, Sheng, Chen, Yueyan, Wu, Xian, Kulkarni, Ninad, Zhang, Yi, Shen, Zhengyuan, Bespalov, Dmitriy, Mishra, Soumya Smruti, Teng, Yifei, Wang, Darren Yow-Bang, Ding, Haibo, Cheong, Lin Lee
Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems. We present IPR\, -- \,a quality-constrained \textbf{I}ntelligent \textbf{P}rompt \textbf{R}outing framework that dynamically selects optimal models based on predicted response quality and user-specified tolerance levels. IPR introduces three key innovations: (1) a modular architecture with lightweight quality estimators trained on 1.5M prompts annotated with calibrated quality scores, enabling fine-grained quality prediction across model families; (2) a user-controlled routing mechanism with tolerance parameter $τ\in [0,1]$ that provides explicit control over quality-cost trade-offs; and (3) an extensible design using frozen encoders with model-specific adapters, reducing new model integration from days to hours. To rigorously train and evaluate IPR, we curate an industrial-level dataset IPRBench\footnote{IPRBench will be released upon legal approval.}, a comprehensive benchmark containing 1.5 million examples with response quality annotations across 11 LLM candidates. Deployed on a major cloud platform, IPR achieves 43.9\% cost reduction while maintaining quality parity with the strongest model in the Claude family and processes requests with sub-150ms latency. The deployed system and additional product details are publicly available at https://aws.amazon.com/bedrock/intelligent-prompt-routing/
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (3 more...)
FlashResearch: Real-time Agent Orchestration for Efficient Deep Research
Nie, Lunyiu, Lipka, Nedim, Rossi, Ryan A., Chaudhuri, Swarat
Deep research agents, which synthesize information across diverse sources, are significantly constrained by their sequential reasoning processes. To overcome this, we introduce FlashResearch, a novel framework for efficient deep research that transforms sequential processing into parallel, runtime orchestration by dynamically decomposing complex queries into tree-structured sub-tasks. Our core contributions are threefold: (1) an adaptive planner that dynamically allocates computational resources by determining research breadth and depth based on query complexity; (2) a real-time orchestration layer that monitors research progress and prunes redundant paths to reallocate resources and optimize efficiency; and (3) a multi-dimensional parallelization framework that enables concurrency across both research breadth and depth. Experiments show that FlashResearch consistently improves final report quality within fixed time budgets, and can deliver up to a 5 speedup while maintaining comparable quality. Deep research tasks, which involve synthesizing information from diverse sources and navigating complex, interdependent concepts, pose significant challenges for existing AI systems. These tasks often demand knowledge retrieval, advanced reasoning, sophisticated tool use, and dynamic planning over multiple steps under structural uncertainty and evolving objectives (Du et al., 2025). Applications include literature review (Haman & ˇ Skoln ık, 2025), open-domain question answering, and policy analysis (Gambrell, 2025), where the ability to evaluate conflicting perspectives, explore hypotheses, and revise beliefs as new evidence emerges is essential.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)